Load packages.
library(here) # A Simpler Way to Find Your Files, CRAN
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN
library(dada2) # Accurate, high-resolution sample inference from amplicon data, Bioconductor
library(plyr) # Tools for Splitting, Applying and Combining Data, CRAN
library(DT) # A Wrapper of the JavaScript Library 'DataTables', CRAN
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN
library(biomformat) # An interface package for the BIOM file format, Bioconductor
# Set seed
set.seed(1910)
Set the path to the fastq files:
path <- here::here("data/raw/casava-18-paired-end-demultiplexed-run2")
head(list.files(path))
## [1] "AqFl1-001_S1_L001_R1_001.fastq.gz" "AqFl1-001_S1_L001_R2_001.fastq.gz"
## [3] "AqFl1-004_S12_L001_R1_001.fastq.gz" "AqFl1-004_S12_L001_R2_001.fastq.gz"
## [5] "AqFl1-006_S23_L001_R1_001.fastq.gz" "AqFl1-006_S23_L001_R2_001.fastq.gz"
Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl1-001" "AqFl1-004" "AqFl1-006" "AqFl1-008" "AqFl1-010" "AqFl1-012"
We start by visualizing the quality profiles of the forward reads:
plotQualityProfile(fnFs[1:2]) +
scale_x_continuous(limits = c(0, 300), breaks = 0:6*50)
In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quantiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).
The forward reads are good quality. We generally trim the last few nucleotides to avoid less well-controlled errors that can arise there. These quality profiles do not suggest that any additional trimming is needed. We will truncate the forward reads at position 290 (trimming the last 10 nucleotides).
Now we visualize the quality profile of the reverse reads:
plotQualityProfile(fnRs[1:2]) +
scale_x_continuous(limits = c(0, 300), breaks = 0:6*50)
The reverse reads are of significantly worse quality, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 248 where the quality distribution crashes.
Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.
Assign filenames for the filtered fastq files.
# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
We’ll use standard filtering parameters: maxN = 0 (DADA2 requires no Ns), truncQ = 2, rm.phix = TRUE and maxEE = 2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft = c(20, 18). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(20, 18),
truncLen = c(290,248), maxN = 0, maxEE = c(2,2), truncQ = 2,
rm.phix = TRUE, compress = TRUE, multithread = TRUE)
head(out)
## reads.in reads.out
## AqFl1-001_S1_L001_R1_001.fastq.gz 61001 50347
## AqFl1-004_S12_L001_R1_001.fastq.gz 78819 66715
## AqFl1-006_S23_L001_R1_001.fastq.gz 66665 48526
## AqFl1-008_S34_L001_R1_001.fastq.gz 34873 27090
## AqFl1-010_S45_L001_R1_001.fastq.gz 64724 43901
## AqFl1-012_S56_L001_R1_001.fastq.gz 74970 61428
Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE = c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads we must maintain overlap after truncation in order to merge them later.
The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
errF <- learnErrors(filtFs, multithread = TRUE)
## 106911360 total bases in 395968 reads from 8 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 109339700 total bases in 475390 reads from 10 samples will be used for learning the error rates.
It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:
plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ = TRUE)
The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.
We are now ready to apply the core sample inference algorithm to the data. By default, the dada function processes each sample independently, which removes singletons in each samples. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. The dada2 package offers two types of pooling. dada(..., pool=TRUE) performs standard pooled processing, in which all samples are pooled together for sample inference. dada(..., pool="pseudo") performs pseudo-pooling, in which samples are processed independently after sharing information between samples, approximating pooled sample inference in linear time. Here, we use the default method (pool = FALSE).
dadaFs <- dada(filtFs, err = errF, pool = FALSE, multithread = TRUE)
## Sample 1 - 50347 reads in 7247 unique sequences.
## Sample 2 - 66715 reads in 11025 unique sequences.
## Sample 3 - 48526 reads in 6067 unique sequences.
## Sample 4 - 27090 reads in 3773 unique sequences.
## Sample 5 - 43901 reads in 5699 unique sequences.
## Sample 6 - 61428 reads in 10371 unique sequences.
## Sample 7 - 53666 reads in 7904 unique sequences.
## Sample 8 - 44295 reads in 7159 unique sequences.
## Sample 9 - 37892 reads in 5714 unique sequences.
## Sample 10 - 41530 reads in 5211 unique sequences.
## Sample 11 - 50938 reads in 9048 unique sequences.
## Sample 12 - 18992 reads in 2722 unique sequences.
## Sample 13 - 31110 reads in 4118 unique sequences.
## Sample 14 - 46345 reads in 6949 unique sequences.
## Sample 15 - 45224 reads in 7975 unique sequences.
## Sample 16 - 34068 reads in 5004 unique sequences.
## Sample 17 - 44920 reads in 7029 unique sequences.
## Sample 18 - 51849 reads in 5799 unique sequences.
## Sample 19 - 46312 reads in 5707 unique sequences.
## Sample 20 - 26968 reads in 3651 unique sequences.
## Sample 21 - 33793 reads in 3963 unique sequences.
## Sample 22 - 60707 reads in 9424 unique sequences.
## Sample 23 - 58846 reads in 8897 unique sequences.
## Sample 24 - 56827 reads in 6651 unique sequences.
## Sample 25 - 31090 reads in 4063 unique sequences.
## Sample 26 - 58156 reads in 9390 unique sequences.
## Sample 27 - 53463 reads in 10986 unique sequences.
## Sample 28 - 20146 reads in 2941 unique sequences.
## Sample 29 - 25205 reads in 4083 unique sequences.
## Sample 30 - 52887 reads in 8614 unique sequences.
## Sample 31 - 7936 reads in 1730 unique sequences.
## Sample 32 - 52384 reads in 8129 unique sequences.
## Sample 33 - 46222 reads in 7401 unique sequences.
## Sample 34 - 52519 reads in 5984 unique sequences.
## Sample 35 - 44623 reads in 5939 unique sequences.
## Sample 36 - 41098 reads in 6468 unique sequences.
## Sample 37 - 66723 reads in 9488 unique sequences.
## Sample 38 - 57559 reads in 8651 unique sequences.
## Sample 39 - 35697 reads in 4505 unique sequences.
## Sample 40 - 33292 reads in 5034 unique sequences.
## Sample 41 - 33312 reads in 4583 unique sequences.
## Sample 42 - 44021 reads in 5828 unique sequences.
## Sample 43 - 67734 reads in 14235 unique sequences.
## Sample 44 - 33348 reads in 5428 unique sequences.
## Sample 45 - 35883 reads in 4619 unique sequences.
## Sample 46 - 54997 reads in 8812 unique sequences.
## Sample 47 - 35530 reads in 4840 unique sequences.
## Sample 48 - 72532 reads in 13777 unique sequences.
## Sample 49 - 32613 reads in 3736 unique sequences.
## Sample 50 - 60495 reads in 6514 unique sequences.
## Sample 51 - 69810 reads in 8404 unique sequences.
## Sample 52 - 31441 reads in 4383 unique sequences.
## Sample 53 - 58706 reads in 6695 unique sequences.
## Sample 54 - 65226 reads in 7650 unique sequences.
## Sample 55 - 56008 reads in 6595 unique sequences.
## Sample 56 - 68755 reads in 7095 unique sequences.
## Sample 57 - 54351 reads in 12018 unique sequences.
## Sample 58 - 47278 reads in 7152 unique sequences.
## Sample 59 - 59153 reads in 14769 unique sequences.
## Sample 60 - 30834 reads in 4319 unique sequences.
## Sample 61 - 61041 reads in 12778 unique sequences.
## Sample 62 - 49320 reads in 6913 unique sequences.
## Sample 63 - 36525 reads in 4632 unique sequences.
## Sample 64 - 61309 reads in 12372 unique sequences.
## Sample 65 - 29816 reads in 7231 unique sequences.
## Sample 66 - 22233 reads in 2512 unique sequences.
## Sample 67 - 41813 reads in 14870 unique sequences.
## Sample 68 - 21879 reads in 11091 unique sequences.
## Sample 69 - 5 reads in 5 unique sequences.
## Sample 70 - 4 reads in 4 unique sequences.
## Sample 71 - 1736 reads in 507 unique sequences.
## Sample 72 - 1 reads in 1 unique sequences.
## Sample 73 - 8397 reads in 1029 unique sequences.
## Sample 74 - 4599 reads in 694 unique sequences.
dadaRs <- dada(filtRs, err = errR, pool = FALSE, multithread = TRUE)
## Sample 1 - 50347 reads in 11056 unique sequences.
## Sample 2 - 66715 reads in 14761 unique sequences.
## Sample 3 - 48526 reads in 9286 unique sequences.
## Sample 4 - 27090 reads in 6339 unique sequences.
## Sample 5 - 43901 reads in 9959 unique sequences.
## Sample 6 - 61428 reads in 14346 unique sequences.
## Sample 7 - 53666 reads in 12019 unique sequences.
## Sample 8 - 44295 reads in 9374 unique sequences.
## Sample 9 - 37892 reads in 8917 unique sequences.
## Sample 10 - 41530 reads in 8496 unique sequences.
## Sample 11 - 50938 reads in 12090 unique sequences.
## Sample 12 - 18992 reads in 5027 unique sequences.
## Sample 13 - 31110 reads in 7367 unique sequences.
## Sample 14 - 46345 reads in 10815 unique sequences.
## Sample 15 - 45224 reads in 11855 unique sequences.
## Sample 16 - 34068 reads in 7685 unique sequences.
## Sample 17 - 44920 reads in 10879 unique sequences.
## Sample 18 - 51849 reads in 10231 unique sequences.
## Sample 19 - 46312 reads in 9772 unique sequences.
## Sample 20 - 26968 reads in 6741 unique sequences.
## Sample 21 - 33793 reads in 7604 unique sequences.
## Sample 22 - 60707 reads in 14440 unique sequences.
## Sample 23 - 58846 reads in 13872 unique sequences.
## Sample 24 - 56827 reads in 11569 unique sequences.
## Sample 25 - 31090 reads in 6476 unique sequences.
## Sample 26 - 58156 reads in 12726 unique sequences.
## Sample 27 - 53463 reads in 13124 unique sequences.
## Sample 28 - 20146 reads in 5132 unique sequences.
## Sample 29 - 25205 reads in 6779 unique sequences.
## Sample 30 - 52887 reads in 12009 unique sequences.
## Sample 31 - 7936 reads in 4629 unique sequences.
## Sample 32 - 52384 reads in 10719 unique sequences.
## Sample 33 - 46222 reads in 10608 unique sequences.
## Sample 34 - 52519 reads in 9381 unique sequences.
## Sample 35 - 44623 reads in 8361 unique sequences.
## Sample 36 - 41098 reads in 9688 unique sequences.
## Sample 37 - 66723 reads in 14974 unique sequences.
## Sample 38 - 57559 reads in 12533 unique sequences.
## Sample 39 - 35697 reads in 8936 unique sequences.
## Sample 40 - 33292 reads in 7771 unique sequences.
## Sample 41 - 33312 reads in 7525 unique sequences.
## Sample 42 - 44021 reads in 9338 unique sequences.
## Sample 43 - 67734 reads in 16903 unique sequences.
## Sample 44 - 33348 reads in 8638 unique sequences.
## Sample 45 - 35883 reads in 8299 unique sequences.
## Sample 46 - 54997 reads in 12796 unique sequences.
## Sample 47 - 35530 reads in 7930 unique sequences.
## Sample 48 - 72532 reads in 16710 unique sequences.
## Sample 49 - 32613 reads in 6073 unique sequences.
## Sample 50 - 60495 reads in 10312 unique sequences.
## Sample 51 - 69810 reads in 12339 unique sequences.
## Sample 52 - 31441 reads in 7240 unique sequences.
## Sample 53 - 58706 reads in 10572 unique sequences.
## Sample 54 - 65226 reads in 11885 unique sequences.
## Sample 55 - 56008 reads in 9935 unique sequences.
## Sample 56 - 68755 reads in 10657 unique sequences.
## Sample 57 - 54351 reads in 16060 unique sequences.
## Sample 58 - 47278 reads in 10101 unique sequences.
## Sample 59 - 59153 reads in 19116 unique sequences.
## Sample 60 - 30834 reads in 7400 unique sequences.
## Sample 61 - 61041 reads in 17535 unique sequences.
## Sample 62 - 49320 reads in 11502 unique sequences.
## Sample 63 - 36525 reads in 7700 unique sequences.
## Sample 64 - 61309 reads in 15856 unique sequences.
## Sample 65 - 29816 reads in 7884 unique sequences.
## Sample 66 - 22233 reads in 3662 unique sequences.
## Sample 67 - 41813 reads in 15933 unique sequences.
## Sample 68 - 21879 reads in 11290 unique sequences.
## Sample 69 - 5 reads in 5 unique sequences.
## Sample 70 - 4 reads in 4 unique sequences.
## Sample 71 - 1736 reads in 686 unique sequences.
## Sample 72 - 1 reads in 1 unique sequences.
## Sample 73 - 8397 reads in 1787 unique sequences.
## Sample 74 - 4599 reads in 1097 unique sequences.
Inspect the dada-class object
dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 283 sequence variants were inferred from 7247 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
## sequence
## 1 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGGACTGAGACACGGCCCAG
## 2 GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTCAGCTTGCTGGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCTTGCACTCTGGGATAAGCTTGGGAAACTGGGTCTAATACCGGATATGAACTGCCTTTAGTGTGGTGGTTGGAAAGTTTTTTCGGTGCAAGATGAGCTCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGTACGGCCACATTGGGACTGAGATACGGCCCAG
## 3 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGGAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## 4 GATGAACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGAAGAGGCCCTTAGACTTGGTGCTTGCACTAAAGACAAGGATAACCTCTTAGTGGCGAACGGGTGAGTAACACGTAGGTAACCTACAACTAAGACGAGGACAACAGTTGGAAACGACTGCTAATACTGGATAGGAATTAAGATTGCATATTCTTGATTTTAAAGTAGCTTCTGGCTACACTTAGAGATGGACCTGCGGCGCATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACATTGGGACTGAGACACGGCCCAA
## 5 GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTGCAGCTTGCTGCGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTCAACTTTGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATATGACCTTTCTTTAGTGTGGAAGGTGGAAAGCTTTTGCGGTTGGGGATGAGCTCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGTACGGCCACATTGGGACTGAGATACGGCCCAG
## 6 GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 3247 2 2 172 0 0 1 TRUE
## 2 2492 4 7 194 0 0 1 TRUE
## 3 1925 3 6 168 0 0 1 TRUE
## 4 1846 1 5 164 0 0 1 TRUE
## 5 1790 5 4 192 0 0 2 TRUE
## 6 1646 6 1 168 0 0 2 TRUE
Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?
We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.
seqtab <- makeSequenceTable(mergers)
The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.
dim(seqtab)
## [1] 74 6252
Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab))) %>%
as.data.frame() %>%
rename("seqence length" = Var1) %>%
datatable(options = list(
columnDefs = list(list(className = 'dt-left', targets = c(0:2)))
))
Plot sequence length distribution
seqLen <- nchar(getSequences(seqtab)) %>%
as.data.frame() %>%
rename(seqLen = ".")
ggplot(seqLen, aes(x = seqLen)) +
geom_histogram(binwidth = 1, alpha = 0.2, position = "identity", colour = "red") +
geom_vline(aes(xintercept = mean(seqLen)), color = "blue", linetype = "dashed", size = 0.5) +
scale_x_continuous(breaks = seq(round_any(min(seqLen), 10),
round_any(max(seqLen)*1.05, 10, f = ceiling),
round_any((max(seqLen)-min(seqLen))/10, 10, f = ceiling))) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)),
breaks = seq(0,
round_any(max(table(seqLen))*1.05, 10, f = ceiling),
round_any(max(table(seqLen))/10, 10, f = ceiling))) +
labs(x = "sequence length (bp)", title = "Amplicon length distribution") +
annotate("text", label = "mean length", x = mean(seqLen$seqLen)-2,
y = round_any(max(table(seqLen)), 10, f = ceiling), hjust = 1, colour = "blue") +
theme_bw()
Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes. We’ll just leave them as they are.
The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool = FALSE during the sample inference, we should use method = "consensus" for the chimera removal.
seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, verbose = TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9818755
dim(seqtab.nochim)
## [1] 74 4448
The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.
Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:
getN <- function(x) sum(getUniques(x))
stats <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(stats) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(stats) <- sample.names
datatable(stats)
Plot the sequences stats:
p_stats <- stats %>%
as.data.frame() %>%
rownames_to_column("SampleID") %>%
mutate_at(vars("filtered":"nonchim"), ~100*.x/input) %>%
mutate(input = 100) %>%
gather(key = "step", value = "percent", -SampleID) %>%
mutate(step = factor(step, levels = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim"))) %>%
ggplot(aes(x = step, y = percent, color = SampleID)) +
geom_point() +
geom_line(aes(group = SampleID)) +
scale_y_continuous(breaks = 0:10*10) +
labs(x = "", y = "Reads retained (%)") +
theme_bw()
ggplotly(p_stats, tooltip = c("x", "y", "colour"))
Except for 7 samples on the lower part of the plot, 6 of which are negative controls, we kept the majority of our raw reads. Sample AqFl1-062 lost a large fraction of raw reads during filtering, suggesting worse read quality than other samples.
Considerations: This is a great place to do a last sanity check. Outside of filtering, there should be no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.
Export feature table:
t(seqtab.nochim) %>%
make_biom() %>%
write_biom(here::here("data/intermediate/dada2/table-run2.biom"))
Export representative sequences:
uniquesToFasta(seqtab.nochim, fout = here::here("data/intermediate/dada2/rep-seqs-run2.fna"),
ids = colnames(seqtab.nochim))
The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.16). For more documentations and tutorials, visit the DADA2 website.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=nb_NO.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=nb_NO.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=nb_NO.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biomformat_1.20.0 plotly_4.9.4.1 DT_0.18 plyr_1.8.6
## [5] dada2_1.20.0 Rcpp_1.0.7 forcats_0.5.1 stringr_1.4.0
## [9] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [13] tibble_3.1.2 ggplot2_3.3.5 tidyverse_1.3.1 here_1.0.1
## [17] knitr_1.33
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 hwriter_1.3.2
## [3] ellipsis_0.3.2 rprojroot_2.0.2
## [5] htmlTable_2.2.1 XVector_0.32.0
## [7] GenomicRanges_1.44.0 base64enc_0.1-3
## [9] fs_1.5.0 rstudioapi_0.13
## [11] farver_2.1.0 fansi_0.5.0
## [13] lubridate_1.7.10 xml2_1.3.2
## [15] codetools_0.2-18 splines_4.1.1
## [17] Formula_1.2-4 jsonlite_1.7.2
## [19] Rsamtools_2.8.0 broom_0.7.6
## [21] cluster_2.1.2 dbplyr_2.1.1
## [23] png_0.1-7 compiler_4.1.1
## [25] httr_1.4.2 backports_1.2.1
## [27] assertthat_0.2.1 Matrix_1.3-4
## [29] lazyeval_0.2.2 cli_3.0.0
## [31] htmltools_0.5.1.1 tools_4.1.1
## [33] gtable_0.3.0 glue_1.4.2
## [35] GenomeInfoDbData_1.2.6 reshape2_1.4.4
## [37] ShortRead_1.50.0 Biobase_2.52.0
## [39] cellranger_1.1.0 jquerylib_0.1.4
## [41] rhdf5filters_1.4.0 vctrs_0.3.8
## [43] Biostrings_2.60.1 crosstalk_1.1.1
## [45] xfun_0.24 rvest_1.0.0
## [47] lifecycle_1.0.0 zlibbioc_1.38.0
## [49] scales_1.1.1 hms_1.0.0
## [51] MatrixGenerics_1.4.0 parallel_4.1.1
## [53] SummarizedExperiment_1.22.0 rhdf5_2.36.0
## [55] RColorBrewer_1.1-2 yaml_2.2.1
## [57] gridExtra_2.3 sass_0.4.0
## [59] rpart_4.1-15 latticeExtra_0.6-29
## [61] stringi_1.6.2 highr_0.9
## [63] S4Vectors_0.30.0 checkmate_2.0.0
## [65] BiocGenerics_0.38.0 BiocParallel_1.26.1
## [67] GenomeInfoDb_1.28.1 rlang_0.4.11
## [69] pkgconfig_2.0.3 bitops_1.0-7
## [71] matrixStats_0.59.0 evaluate_0.14
## [73] lattice_0.20-45 Rhdf5lib_1.14.2
## [75] labeling_0.4.2 GenomicAlignments_1.28.0
## [77] htmlwidgets_1.5.3 tidyselect_1.1.1
## [79] magrittr_2.0.1 R6_2.5.0
## [81] IRanges_2.26.0 generics_0.1.0
## [83] Hmisc_4.5-0 DelayedArray_0.18.0
## [85] DBI_1.1.1 pillar_1.6.1
## [87] haven_2.4.1 foreign_0.8-81
## [89] withr_2.4.2 survival_3.2-13
## [91] RCurl_1.98-1.3 nnet_7.3-16
## [93] modelr_0.1.8 crayon_1.4.1
## [95] utf8_1.2.1 rmarkdown_2.9
## [97] jpeg_0.1-8.1 grid_4.1.1
## [99] readxl_1.3.1 data.table_1.14.0
## [101] reprex_2.0.0 digest_0.6.27
## [103] RcppParallel_5.1.4 stats4_4.1.1
## [105] munsell_0.5.0 viridisLite_0.4.0
## [107] bslib_0.2.5.1